Take-home Exercise 1 Part I - Creating Data Visualisation Beyond Default

Author

Alicia Loh

Modified

April 19, 2024

Context

There are two major residential property market in Singapore, namely public and private housing. Public housing aims to meet the basic need of the general public with monthly household income less than or equal to S$14,000. For families with monthly household income more than S$14,000, they need to turn to the private residential market.

The Task

Assuming the role of a graphical editor of a median company, a minimum two and maximum three of data visualisations are prepared to reveal the private residential market and sub-markets of Singapore for the 1st quarter of 2024.

The Data

To accomplish the task, transaction data of REALIS will be used.

Downloading the Dataset

  1. Access Dataset via SMU e-library

  2. After logging in with SMU credentials, navigate to “Residential” tab

  3. Under Property Types, “Select All”

  4. Under Sale Date, select “2023 Jan” - “2024 Mar”

  5. Click “Search”

  6. Click “Download”

  7. Due to the size of the dataset, it is split into multiple segments. Download all in .csv format

The Designing Tool

The data will be processed using the appropriate tidyverse family of packages and the statistical graphics will be prepared using ggplot2 and its extensions.

Getting Started

Installing and loading the required libraries

Note: Ensure that the pacman package has already been installed.

The code chunk below uses p_load() of pacman package to check if the listed packages are installed in the computer. If they are, then they will be launched into R. Otherwise, tidyverse will be installed and launched into R.

  • tidyverse: (i.e. readr, tidyr, dplyr, ggplot2, lubridate) for performing data science tasks such as importing, tidying, and wrangling data, as well as creating graphics based on The Grammar of Graphics

  • reshape2 for transforming data between wide and long formats

  • ggridges for creating ridgeline plots

  • ggdist for visualising distributions and uncertainty

  • ggrepel: provides geoms for ggplot2 to repel overlapping text labels.

  • ggthemes: provides some extra themes, geoms, and scales for ‘ggplot2’.

  • hrbrthemes: provides typography-centric themes and theme components for ggplot2.

  • patchwork: preparing composite figure created using ggplot2

pacman::p_load(tidyverse, reshape2, ggridges, ggdist,
               ggrepel, ggthemes, hrbrthemes, patchwork)

Importing the Data

  • The data has been split into 5 .csv files

  • Define the path to the directory containing the CSV files.

  • Use list.files() to list all CSV files in the specified directory.

  • Loop through each CSV file, read it into a data frame using read_csv(), and store it in a list.

  • Use bind_rows() to combine all data frames in the list into a single big data frame.

csv_directory <- "data/"
csv_files <- list.files(csv_directory, pattern = "\\.csv$", full.names = TRUE)

realis <- list()

for (file in csv_files) {
  realis[[file]] <- read_csv(file)
}
realis_all <- bind_rows(realis)

View Data

  • Use the names() function to print the names of the columns in the tibble data frame.
  • Use the glimpse() function to get a quick overview of the tibble data frame
col_names <- names(realis_all)
col_names
 [1] "Project Name"                "Transacted Price ($)"       
 [3] "Area (SQFT)"                 "Unit Price ($ PSF)"         
 [5] "Sale Date"                   "Address"                    
 [7] "Type of Sale"                "Type of Area"               
 [9] "Area (SQM)"                  "Unit Price ($ PSM)"         
[11] "Nett Price($)"               "Property Type"              
[13] "Number of Units"             "Tenure"                     
[15] "Completion Date"             "Purchaser Address Indicator"
[17] "Postal Code"                 "Postal District"            
[19] "Postal Sector"               "Planning Region"            
[21] "Planning Area"              
glimpse(realis_all)
Rows: 26,806
Columns: 21
$ `Project Name`                <chr> "THE REEF AT KING'S DOCK", "URBAN TREASU…
$ `Transacted Price ($)`        <dbl> 2317000, 1823500, 1421112, 1258112, 1280…
$ `Area (SQFT)`                 <dbl> 882.65, 882.65, 1076.40, 1033.34, 871.88…
$ `Unit Price ($ PSF)`          <dbl> 2625, 2066, 1320, 1218, 1468, 1767, 1095…
$ `Sale Date`                   <chr> "01 Jan 2023", "02 Jan 2023", "02 Jan 20…
$ Address                       <chr> "12 HARBOURFRONT AVENUE #05-32", "205 JA…
$ `Type of Sale`                <chr> "New Sale", "New Sale", "New Sale", "New…
$ `Type of Area`                <chr> "Strata", "Strata", "Strata", "Strata", …
$ `Area (SQM)`                  <dbl> 82.0, 82.0, 100.0, 96.0, 81.0, 308.7, 42…
$ `Unit Price ($ PSM)`          <dbl> 28256, 22238, 14211, 13105, 15802, 19015…
$ `Nett Price($)`               <chr> "-", "-", "-", "-", "-", "-", "-", "-", …
$ `Property Type`               <chr> "Condominium", "Condominium", "Executive…
$ `Number of Units`             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ Tenure                        <chr> "99 yrs from 12/01/2021", "Freehold", "9…
$ `Completion Date`             <chr> "Uncompleted", "Uncompleted", "Uncomplet…
$ `Purchaser Address Indicator` <chr> "HDB", "Private", "HDB", "HDB", "HDB", "…
$ `Postal Code`                 <chr> "097996", "419535", "269343", "269294", …
$ `Postal District`             <chr> "04", "14", "27", "27", "28", "19", "10"…
$ `Postal Sector`               <chr> "09", "41", "26", "26", "79", "54", "27"…
$ `Planning Region`             <chr> "Central Region", "East Region", "North …
$ `Planning Area`               <chr> "Bukit Merah", "Bedok", "Yishun", "Yishu…
Note

realis_all contains:

  • Public and Private residential property transaction data from 1st January 2023 to 31st March 2024.

  • There are 26806 rows and 21 columns.

Data Preparation

The task only requires data from the private residential market and sub-markets of Singapore for the 1st quarter of 2024.

Standardise Date Column Format

The “Sales Date” column is currently a cha type. It needs to be converted into date format.

dmy() is a function from the lubridate package that converts character strings to date format in the day-month-year (DMY) order.

Standardise Date Format and verify column type after standardisation

realis_all$`Sale Date` <- dmy(realis_all$`Sale Date`)
class(realis_all$`Sale Date`)
[1] "Date"

Check data

head(realis_all$`Sale Date`)
[1] "2023-01-01" "2023-01-02" "2023-01-02" "2023-01-02" "2023-01-03"
[6] "2023-01-03"

Keep only relevant rows

Filter and keep only rows that:

  • Sales Date that occur within Q1 2024 i.e. between 01 Jan 2024 to 31 Mar 2024 inclusive.

  • Purchaser Address Indicator not equal to “HDB”

In addition, any duplicate or empty rows are also removed.

q1_pte_raw <- realis_all %>%
    filter(`Sale Date` >= as.Date("2024-01-01") & 
          `Sale Date` <= as.Date("2024-03-31"),
          `Purchaser Address Indicator` != "HDB") %>%
    distinct() %>%
    drop_na()

View Data

glimpse(q1_pte_raw)
Rows: 3,567
Columns: 21
$ `Project Name`                <chr> "THE LANDMARK", "POLLEN COLLECTION", "TE…
$ `Transacted Price ($)`        <dbl> 2726888, 3850000, 2190000, 1954000, 3412…
$ `Area (SQFT)`                 <dbl> 1076.40, 1808.35, 807.30, 796.54, 1323.9…
$ `Unit Price ($ PSF)`          <dbl> 2533, 2129, 2713, 2453, 2577, 838, 2007,…
$ `Sale Date`                   <date> 2024-01-01, 2024-01-01, 2024-01-01, 202…
$ Address                       <chr> "173 CHIN SWEE ROAD #22-11", "34 POLLEN …
$ `Type of Sale`                <chr> "New Sale", "New Sale", "New Sale", "New…
$ `Type of Area`                <chr> "Strata", "Land", "Strata", "Strata", "S…
$ `Area (SQM)`                  <dbl> 100.0, 168.0, 75.0, 74.0, 123.0, 328.0, …
$ `Unit Price ($ PSM)`          <dbl> 27269, 22917, 29200, 26405, 27741, 9024,…
$ `Nett Price($)`               <chr> "-", "-", "-", "-", "-", "-", "-", "-", …
$ `Property Type`               <chr> "Condominium", "Terrace House", "Apartme…
$ `Number of Units`             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ Tenure                        <chr> "99 yrs from 28/08/2020", "99 yrs from 0…
$ `Completion Date`             <chr> "Uncompleted", "Uncompleted", "Uncomplet…
$ `Purchaser Address Indicator` <chr> "Private", "N.A", "N.A", "Private", "Pri…
$ `Postal Code`                 <chr> "169878", "807233", "118992", "598444", …
$ `Postal District`             <chr> "03", "28", "05", "21", "21", "28", "20"…
$ `Postal Sector`               <chr> "16", "80", "11", "59", "58", "79", "57"…
$ `Planning Region`             <chr> "Central Region", "North East Region", "…
$ `Planning Area`               <chr> "Outram", "Serangoon", "Queenstown", "Bu…
Note

q1_pte_raw contains:

  • Private residential property transaction data from 1st January 2024 to 31st March 2024

  • There are 3567 rows and 21 columns.

Keep only relevant columns

Not all 21 columns will be used for analysis e.g. irrelevant, contains overlapping information as another column. Only relevant columns will be kept.

Columns to drop:

  • Type of Area: Not used in analysis

  • Area (SQM): Similar information as Area (SQFT)

  • Unit Price ($ PSM): Similar information as Unit Price ($ PSF)

  • Nett Price ($): Similar information as Transacted Price ($)

  • Purchaser Address Indicator: Not used in analysis

  • Postal District and Postal Sector: Overlapping information as Postal Code

The select() function is used to choose the columns to keep. However, by prefixing the column names with a minus sign (-), the function will drop the specified columns instead.

q1_pte <- q1_pte_raw %>%
    select(
        -`Type of Area`,
        -`Area (SQM)`,
        -`Unit Price ($ PSM)`,
        -`Nett Price($)`,
        -`Purchaser Address Indicator`,
        -`Postal District`,
        -`Postal Sector`
    )

View Data

glimpse(q1_pte)
Rows: 3,567
Columns: 14
$ `Project Name`         <chr> "THE LANDMARK", "POLLEN COLLECTION", "TERRA HIL…
$ `Transacted Price ($)` <dbl> 2726888, 3850000, 2190000, 1954000, 3412201, 29…
$ `Area (SQFT)`          <dbl> 1076.40, 1808.35, 807.30, 796.54, 1323.97, 3530…
$ `Unit Price ($ PSF)`   <dbl> 2533, 2129, 2713, 2453, 2577, 838, 2007, 1756, …
$ `Sale Date`            <date> 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-01…
$ Address                <chr> "173 CHIN SWEE ROAD #22-11", "34 POLLEN PLACE",…
$ `Type of Sale`         <chr> "New Sale", "New Sale", "New Sale", "New Sale",…
$ `Property Type`        <chr> "Condominium", "Terrace House", "Apartment", "C…
$ `Number of Units`      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Tenure                 <chr> "99 yrs from 28/08/2020", "99 yrs from 09/12/20…
$ `Completion Date`      <chr> "Uncompleted", "Uncompleted", "Uncompleted", "U…
$ `Postal Code`          <chr> "169878", "807233", "118992", "598444", "589605…
$ `Planning Region`      <chr> "Central Region", "North East Region", "Central…
$ `Planning Area`        <chr> "Outram", "Serangoon", "Queenstown", "Bukit Tim…
Note

q1_pte contains:

  • Private residential property transaction data from 1st January 2024 to 31st March 2024

  • There are 3567 rows and 14 columns.

Visualisation

There are various types of Properties for Private residences.

unique(q1_pte$`Property Type`)
[1] "Condominium"           "Terrace House"         "Apartment"            
[4] "Executive Condominium" "Semi-Detached House"   "Detached House"       

The different types in the dataset are:

  • Condominium

  • Terrace House

  • Apartment

  • Executive Condominium

  • Semi-Detached House

  • Detached House

Price Distribution

Create box plots for the price distribution of each private residence property type. geom_boxplot() displays continuous value list. It visualises five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually.

Note: For better visibility, labels parameter with the function scales::number, formats the y-axis labels to include thousands separators.

box_plot <- ggplot(data = q1_pte, 
                   aes(x = `Property Type`, y = `Transacted Price ($)`)) + 
            geom_boxplot() +
            labs(title = "Box Plot of Transacted Price ($) by Property Type",
                x = "Property Type",
                y = "Transacted Price ($)") +
            scale_y_continuous(labels = scales::number) + 
            theme_gray() + theme(axis.text.x = element_text(angle = 45, 
                                                            hjust = 1,
                                                            vjust=1))
box_plot
Tip

There are six private property types in the dataset.

The boxplot shows Detached Houses generally have the highest transacted prices. The Q1, median, and Q3 values for this property type are all substantially higher than those of other types, indicating its premium market status. Detached Houses also exhibit the highest variability in transacted prices, although there are relatively few outliers.

In contrast, both Executive Condominiums and Terrace Houses demonstrate the least variability in transacted prices, with their interquartile ranges (IQR) closely aligning with their medians. This suggests a stable and consistent pricing trend for these property types.

Executive Condominiums have the lowest transacted prices, indicating their appeal as an affordable option within the private housing market.

Apartments and Condominiums, on the other hand, show a significant number of outliers in their transacted prices, suggesting a wider range of pricing and potentially more diversity in market conditions for these property types.

Geographical Distribution

A choropleth map is used to visualize property sales across different districts or areas in Singapore.

Two data sets will be used to create the choropleth map. They are:

  • MPSZ-2019: This data provides the sub-zone boundary of URA Master Plan 2019. It can be downloaded at data.gov.sg It consists of the geographical boundary of Singapore at the planning subzone level. The data is based on URA Master Plan 2019.

  • Private residential property transaction data from 1st January 2024 to 31st March 2024 in tibble data frame (i.e. q1_pte).

The code chunk below loads the following packages:

  • tmap: for thematic mapping

  • sf: for geospatial data handling

  • httr:

pacman::p_load(tmap,sf,httr, purrr, future, furrr)

Geospatial Data

Geocoding using Singapore Land Authority (SLA) API

Geocoding - provide geographical coordinates corresponding to a location.

Parallel processing is set up to speed up the process. Results are saved into a .csv file for easy future access.

plan(multisession)

url <- "https://www.onemap.gov.sg/api/common/elastic/search"

postcodes <- unique(q1_pte$`Postal Code`)

# Function to fetch data for a single postal code
fetch_postcode_data <- function(postcode) {
    query <- list('searchVal' = postcode, 'returnGeom' = 'Y', 'getAddrDetails' = 'Y', 'pageNum' = '1')
    res <- GET(url, query = query)
    
    if (http_error(res)) {
        return(NULL) 
    }
    
    content_res <- content(res)
    if (content_res$found != 0) {
        return(data.frame(content_res)[4:13])
    } else {
        return(data.frame(postcode = postcode))
    }
}

results <- future_map(postcodes, fetch_postcode_data)

found <- bind_rows(results, .id = "postcode")

found <- found %>%
    filter(!is.na(postcode))

write.csv(found, file = "data/aspatial/found.csv")

Import Geospatial Data

The code chunk below uses the st_read() function of sf package to import MPSZ-2019 shapefile into R as a simple feature data frame called mpsz.

mpsz <- st_read(dsn = "data/geospatial",
                layer = "MPSZ-2019") %>%
  st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `C:\lnealicia\ISSS608\Take-home_Ex\Take-home_Ex01\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

Add coordinates

q1_pte_coor <- q1_pte %>%
    left_join(
        found %>% select(results.POSTAL, results.LATITUDE, results.LONGITUDE),
        by = c("Postal Code" = "results.POSTAL")
    ) %>%
    rename(
        Latitude = results.LATITUDE,
        Longitude = results.LONGITUDE
    ) %>%
   filter(!is.na(Longitude) & !is.na(Latitude))

Convert to a simple features tibble data frame

q1_pte_sf <- st_as_sf(q1_pte_coor,
                       coords = c("Longitude", "Latitude"),
                       crs =4326) %>%
  st_transform(crs = 3414)

Keep relevant columns

q1_pte_sf_plot <- q1_pte_sf %>%
    select(-`Project Name`,
           -`Sale Date`,
           -`Address`,
           -`Type of Sale`,
           -`Tenure`,
           -`Completion Date`)

Average Transacted Price by Planning Area

avg_txn_px <- q1_pte_sf_plot %>%
    group_by(`Planning Area`) %>%
    summarize(
        Avg_Transacted_Price = mean(`Transacted Price ($)`, na.rm = TRUE)
    )

avg_txn_px <- avg_txn_px %>%
    mutate(`Planning Area` = toupper(`Planning Area`))

avg_txn_px <- st_drop_geometry(avg_txn_px)

Combine avg_txn_px and mpsz

Populates the average transacted price of each planning area into mpsz sf data frame. Not every planning area may have transaction data, so empty rows are dropped.

mpsz_avg_txn_px <- mpsz %>%
    left_join(
        avg_txn_px,
        by = c("PLN_AREA_N" = "Planning Area")
    ) %>%
  drop_na()

Plotting a point simple feature layer

The overall plot shows both the average transacted price of the planning area and the distribution of private residences. If viewing the data separately is preferred, click on the relevant tabs.

tmap_mode("view")

map <- tm_shape(mpsz_avg_txn_px) +
    tm_polygons(col = "Avg_Transacted_Price", 
                palette = "Blues", 
                alpha = 0.4,
                style = "quantile") +
    tmap_options(check.and.fix = TRUE) +
    
    tm_shape(q1_pte_sf_plot) +
    tm_dots(col = "Property Type") +

  tm_view(set.zoom.limits = c(11,14))

map
tmap_mode("plot")
tmap_mode("view")

map2 <- tm_shape(mpsz_avg_txn_px) +
    tm_polygons(col = "Avg_Transacted_Price", 
                palette = "Blues", 
                alpha = 0.5,
                style = "quantile") +
    tmap_options(check.and.fix = TRUE) +
  tm_view(set.zoom.limits = c(11,14))

map2
tmap_mode("plot")
tmap_mode("view")

map3 <- tm_shape(mpsz_avg_txn_px) +
    tm_polygons() +
    tm_shape(q1_pte_sf_plot) +
    tm_dots(col = "Property Type") +

  tm_view(set.zoom.limits = c(11,14))

map3
tmap_mode("plot")
Note

tmap_mode(“view”):

  • Designed for interactive viewing of spatial data.

  • Enables zooming, panning, and other interactive features, making it easier to explore the data in detail.

tmap_mode(“plot”):

  • Used for static plotting of spatial data that can be saved as static images e.g., PNG

  • Does not support interactions like zooming and panning.

Setting the mode to plot after the map is generated saves on resources and allows the code to run faster, as the plot does not require continuous updating.

Tip

Using the overall plot,